Data exploration

Introduction

In a well-defined study, exploratory analysis may be largely unnecessary. Consider a scenario where the client has identified questions / hypotheses of interest and the additional variables that may predict or mediate the identified outcomes. An activity partner such as MSI was provided the time and resources to consult with stakeholders, scope out the target environment to identify and map the data generating process, and develop an inferential design by which the collected data and analytical routines would address the questions posed by the client. In this scenario, exploratory analysis may be entirely eliminated, and the time that may be spent on exploration is shifted to sensitivity analysis of the pre-identified analytical routines.

On the other hand, consider a scenario where data has been collected for one purpose, and later realized to have possible use with other, not yet identified, purposes. In this case, the initial analysis is entirely exploratory.

Data visualization

Data visualization tactics

geomtextpath

There are common situations where MSI is tasked with ongoing data collection and evaluation of a client activity. For example, the MENA MELS activity (2020-2024) was tasked with ongoing monitoring and evaluation of the Middle East Partnership for Peace Activity (MEPPA). MEPPA comprised grants to several local partners organized around the common motif of building bonds between different demographic groups. The primary outcome of interest was whether or not a participant in the grant activity reported a perceived increase in understanding the political, social, and economic situation and viewpoints of another group. Data were collected on a rolling basis across several implementing partners, across baseline and endline. That data are summarized as follows:

df1 <- read_csv("data/short demo series/meppa response items.csv",
                show_col_types=F)
df1 %>%
  flextable() %>% 
  autofit()

item

response

lab

endline

n

perc

Political views

1

Basic understanding

0

22

0.286

Political views

2

Fair understanding

0

38

0.494

Political views

3

High understanding

0

17

0.221

Political views

1

Basic understanding

1

18

0.173

Political views

2

Fair understanding

1

50

0.481

Political views

3

High understanding

1

36

0.346

Social views

1

Basic understanding

0

30

0.390

Social views

2

Fair understanding

0

25

0.325

Social views

3

High understanding

0

22

0.286

Social views

1

Basic understanding

1

15

0.147

Social views

2

Fair understanding

1

52

0.510

Social views

3

High understanding

1

35

0.343

Economic views

1

Basic understanding

0

32

0.416

Economic views

2

Fair understanding

0

30

0.390

Economic views

3

High understanding

0

15

0.195

Economic views

1

Basic understanding

1

20

0.196

Economic views

2

Fair understanding

1

55

0.539

Economic views

3

High understanding

1

27

0.265

For purposes of client reporting, there are three ordinal responses across baseline and endline, for each of three types of viewpoint. There is insufficient data to conduct inferential tests at this level of granularity, but this data may be visualized descriptively.

The geomtextpath package offers the functionality to directly label line-based plots with text that is able to follow a curved path. Simply replace ‘geom_line’ with ‘geom_textpath’ and assign the variable to use as the label. The following figures illustrate.

pol <- ggplot(filter(df1, item=="Political views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Political situation") 

pol

soc <- ggplot(filter(df1, item=="Social views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Social situation") 

soc

ec <- ggplot(filter(df1, item=="Economic views"), aes(endline, perc, color=as.factor(response))) + 
  geom_textpath(aes(label=lab),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1)) +
  theme(legend.position="none",
        axis.text.y=element_blank()) +
  labs(x="",
       y="",
       title="Economic situation") 

ec

Given that the three types of understanding of others’ situation are highly correlated, it makes sense to present these measures compactly as aspects of a deeper underlying construct. The patchwork library allows multiple ggplots to be assembled together as a single plot. The following figure illustrates.

pol + soc + ec + 
  plot_annotation(title="How well do you understand the situation of others?")

A final use of presenting the data more compactly is to collapse the ordinal responses to binary, and collect the three measures as lines in a single plot. The following data captures each type of understanding as either fair or high understanding as one category, and basic understanding as the other category.

dat <- read_csv("data/short demo series/meppa item ladder.csv",
                show_col_types=F)

dat_flx <- dat %>%
  flextable() %>%
  autofit() 

dat_flx

endline

n

perc

item

0

55

0.714

Political situation

1

86

0.827

Political situation

0

47

0.610

Social situation

1

87

0.853

Social situation

0

45

0.584

Economic situation

1

82

0.804

Economic situation

With this simplified data summary, the trendline for each type of understanding may now be collected in a single plot.

ggplot(dat, aes(endline, perc, color=as.factor(item))) + 
  geom_textpath(aes(label=item),
                size=4) +
  geom_label(aes(label=paste(round(perc*100,0), "%", sep="")),
             size=4,
             label.padding = unit(.14, "lines")) +
  scale_color_viridis_d(option="D") +
  scale_x_continuous(limits=c(-.1, 1.1),
                     breaks=c(0,1),
                     labels=c("Baseline","Endline")) +
  scale_y_continuous(labels=percent_format(accuracy=1),
                     breaks=c(.5,1),
                     sec.axis=dup_axis(breaks=c(.804,.827,.853),
                                       labels=c("+22","+12","+24"))) +
  theme(legend.position="none",
        axis.text.y.left=element_blank()) +
  labs(x="",
       y="",
       caption="Proportion reporting fair or high\nunderstanding of others' situation")

Note further the use of secondary axis breaks to illustrate the change score for each trendline from baseline to endline.

The R computing language allows for several ways to customize the use of labels in statistical or descriptive graphics. This short demo has illustrated MSI’s use of the geomtextpath package to place labels directly along the line or curve of a plot. This illustration used only straight lines between two points in time. For additional use cases of the geomtextpath package, see the package vignette.

Mapping

Much of our data is collected from surveys and has geographic coordinates associated with a point of collection, a house, or a city, etc. It can be useful to generate a map to get a sense of where the data is coming from. Mapping is a part of MSI’s exploratory data analysis process and is also used in developing a sampling plan and in producing high quality data vizualizations for clients. This section provides a brief introduction into mapping in R.

The most commonly used packages to handle spatial data are sf for vectors, terra for vectors and rasters, and raster for rasters. To visualize the data, two frequently used packages are tmap and ggplot2 packages.

To get started, we need to load our packages.

#run this line first if you have never used these packages before
#install.packages(c("tidyverse", "sf", "tmap", "readr", "here"))

library(tidyverse) #install the core tidyverse packages including ggplot2
library(sf) #provides tools to work with vector data 
library(tmap) #for visualizing spatial data
library(readr) #functions for reading external datasets 
library(here) #to better locate files not in working directory
library(geodata) #to download administrative boundaries

Read in the data

#It is a csv file so I use the read_csv function and provide the file path
cities <- read_csv(here::here("../methods corner/Map demo/data/Madagascar_Cities.csv")
                   , show_col_types = FALSE)

#Observe the first few rows of data
DT::datatable(head(cities))

Now, for the administrative boundaries. Each of these are being read in using gadm() from the geodata package and then converted to an sf object with the st_as_sf() function. In the example, we download only the country boundary, but if we wanted regions or departments, we would simply change the level argument inside the gadm() function call to a 1 or a 2.

Country boundary

#This is only the country boundary. 
mdg <- geodata::gadm(country = "MDG"
                  , level = 0
                  , path = tempdir()) |>
  st_as_sf()

Convert the cities to an sf object

Remember that the cities object is a standard .csv with longitude and latitude columns, but it is not yet recognized as an sf object that has geographic properties. Here is how to convert it to an sf object with a single geometry column and a crs.

cities_sf <- cities |>
  st_as_sf(coords = c("Longitude", "Latitude")
           , crs = 4326)

#observe the first few rows of data
DT::datatable(head(cities_sf))

Make the map

The following code chunks and tabs walk through the process of making and improving a map in both tmap and ggplot2. In the example, cities are what we are plotting, but we could be plotting any variable of a dataset.

tmap_mode("plot") +
  tm_shape(mdg) +
  tm_polygons() + #for only the borders, use tm_borders()
  tm_shape(cities_sf) +
  tm_dots(size = .25, col = "red")

ggplot2::ggplot(mdg) +
  geom_sf() +
  geom_sf(data = cities_sf, color = "red")

Make the map better

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.7 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.75 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box ...
  st_as_sfc() # ... and make it a sf polygon

#now plot the map
tmap_mode("plot") +
  tm_shape(mdg, bbox = bbox_new) +
  tm_polygons() +
  tm_shape(cities_sf) +
  tm_dots(size = .25, col = "red") +
  tm_text(text = "Name", auto.placement = T) +
  tm_layout(title = "Main Cities of\nMadagascar")

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.5 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.5 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box
  st_as_sfc() # ... and make it a sf polygon


ggplot2::ggplot() +
  geom_sf(data = mdg) +
  geom_sf(data = cities_sf, color = "red") +
  ggrepel::geom_text_repel(data = cities_sf
               , aes(label = Name
                     , geometry = geometry)
               , stat = "sf_coordinates"
               , min.segment.length = 0) +
  coord_sf(xlim = st_coordinates(bbox_new)[c(1,2),1], # min & max of x values
           ylim = st_coordinates(bbox_new)[c(2,3),2]) + # min & max of y values +
  labs(title = "Main Cities of\nMadagascar") +
  theme_void()

Final touches

Now that we have a map with cities plotted (we achieved our goal!), we will add a few finishing touches and set the size of the city points to the population variable in the original dataset.

Additionally, tmap provides a simple interface to go from a static map to an interative map simply by changing tmap_mode("plot") to tmap_mode("view").

#the city names are long so we have to 
# make a bigger window to fit them. This isn't part of the normal process
#make an object with the current bounding box
bbox_new <- st_bbox(mdg)

#calculate the x and y ranges of the bbox
xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

#provide the new values for the 4 corners of the bbox
  bbox_new[1] <- bbox_new[1] - (0.7 * xrange) # xmin - left
  bbox_new[3] <- bbox_new[3] + (0.75 * xrange) # xmax - right
  bbox_new[2] <- bbox_new[2] - (0.1 * yrange) # ymin - bottom
  bbox_new[4] <- bbox_new[4] + (0.1 * yrange) # ymax - top

#convert the bbox to a sf collection (sfc)
bbox_new <- bbox_new |>  # take the bounding box ...
  st_as_sfc() # ... and make it a sf polygon
tmap_mode("plot") +
  tm_shape(mdg, bbox = bbox_new) +
  tm_polygons() +
  tm_shape(cities_sf) +
  tm_dots(size = "Population", col = "red"
          , legend.size.is.portrait = TRUE) +
  tm_text(text = "Name", auto.placement = T
          , along.lines = T) +
  tm_scale_bar(position = c("left", "bottom"), width = 0.15) +
  tm_compass(type = "4star"
             , position = c("right", "bottom")
             , size = 2) +
  tm_layout(main.title = "Main Cities of Madagascar"                         , legend.outside = TRUE)

ggplot2::ggplot() +
  geom_sf(data = mdg) +
  geom_sf(data = cities_sf, aes(size = Population)
          , color = "red") +
  ggrepel::geom_text_repel(data = cities_sf
               , aes(label = Name
                     , geometry = geometry)
               , stat = "sf_coordinates"
               , min.segment.length = 0) +
  coord_sf(xlim = st_coordinates(bbox_new)[c(1,2),1], # min & max of x values
           ylim = st_coordinates(bbox_new)[c(2,3),2]) + # min & max of y values +
  ggspatial::annotation_scale(location = "bl") +
  ggspatial::annotation_north_arrow(location = "br"
                                    , which_north = "true"
                                    , size = 1)+
  labs(title = "Main Cities of Madagascar") +
  theme_void()

tmap_mode("view") +
  tm_shape(mdg) +
  tm_borders() +
  tm_shape(cities_sf) +
  tm_dots(size = "Population", col = "red"
          , legend.size.is.portrait = TRUE) +
  tm_text(text = "Name", auto.placement = T
          , along.lines = T) +
  tm_scale_bar(position = c("left", "bottom"), width = 0.15) +
  tm_compass(type = "4star"
             , position = c("right", "bottom")
             , size = 2) +
  tm_layout(main.title = "Main Cities of Madagascar"                         , legend.outside = TRUE)

Additional Resources

For those interested in mapping in R (or QGIS) there are many free resources available online. A great starting point for R is the online text book, Geocomputation with R. If you would rather learn more in Python, Geocomputation with Python is a great resource.